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A  negative  binomial  probability  distribution  was  found  to  give  good  agree¬ 
ment  with  empirical  data  distributions  of  speech  intelligibility  scores. 

A  new  performance  rating  for  voice  communications  devices,  termed 
an  intelligibility  threshold  level  (ITL),  was  conceived  as  a  means  of  taking 
these  findings  into  consideration  in  establishing  a  measure  of  intelligibility 
performance  that  is  an  estimate  of  an  intelligibility  value  that  the  majority 
(rather  than  the  simple  average)  of  intelligibility  scores  for  a  voice  processor 
will  equal  or  exceed,  at  a  specified  confidence  level  established  in  relation 
to  the  sample  size  used  in  obtaining  the  rating. 

It  is  proposed  that  an  ITL  rating  is  a  more  meaningful  assessment  of  the 
degree  of  risk  involved  in  misunderstanding  voice  messages  or  causing  time 
to  be  lost  in  requiring  messages  to  be  repeated. 

It  was  shown  that  ITL's  can  be  determined  by  two  alternative  methods: 
by  rank-ordering  the  intelligibility  stores  for  a  processor  and  constructing 
the  cumulative  distribution  of  data  and  its  confidence  band,  or  by  using  a 
negative  binomial  probability  model  for  the  data  distribution. 

Chi-squared  tests  indicated  that  in  most  cases  the  negative  binomial 
probability  model  gave  a  reasonable  approximation  to  the  data  distribution. 

Intelligibility  threshold  levels  (ITL's)  estimated  with  the  negative 
binomial  probability  model  differed  by  at  most  one  quantum  value  (3.  125) 
from  the  ITL  values  determined  from  the  empirical  distributions. 

It  is  recommended  that  future  speech  intelligibility  tests  and  evaluations 
of  digital  voice  communications  processors  and  systems  include  an  analysis 
of  the  data  to  determine  the  80  percent  ITL's  at  95  percent  probability, 
that  is,  determine  the  intelligibility  level  for  which  there  is  a  95  percent 
probability  that  80  percent  of  the  population  of  intelligibility  scores  (for 
individual  speakers  and  phonetic  features)  will  equal  or  exceed. 
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Intelligibility  Threshold  Level  (ITL) 
Ratings  of  Some  Current  Digital 
Voice  Communications  Processors 


I.  INTRODUCTION 

Over  the  past  several  years  there  has  been  an  opportunity  to  study  speech 
intelligibility  data  obtained  from  tests  and  evaluations  of  a  wide  variety  of  proc¬ 
essors  for  digital  speech  communications  applications.  These  studies  have  led  to 
a  conclusion  that  average  intelligibility  scores  currently  used  to  specify  intelligi¬ 
bility  performance  fall  short  of  providing  adequate  ratings  of  intelligibility,  and 
that  a  need  exists  for  an  alternative  rating  for  speech  intelligibility  that  takes  into 
consideration  dispersion  and  highly  skewed  distributions  of  scores  that  character¬ 
ize  data  obtained  with  multiple  speakers  and  diagnostic  intelligibility  testing.  This 
report  presents  some  of  those  findings  and  presents  a  new  concept  for  a  speech 
intelligibility  rating  to  supplement  or  replace  average  scores  that  rate  the  intelligi¬ 
bility  performance  of  speech  communications  systems  and  devices. 

1.1  Diagnostic  Rhyme  Test  (l)RT) 

The  intelligibility  performance  of  speech  communications  processors  is  for 
the  most  part  evaluated  with  the  Diagnostic  Rhyme  Test  or  DRT.  This  is  a  test 
that  grew  out  of  research  on  very-low -data-rate  digital  speech  communications  by 
the  method  of  voice  pattern-matching.  A  speech  intelligibility  test  method  was 
needed  to  provide  diagnostic  intelligibility  data,  that  is  to  assess  separately  the 

(Received  for  publication  17  August  1979) 
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intelligibility  obtained  for  each  of  the  categories  of  phonetic  events,  as  well  as 
assessing  overall  or  "total"  intelligibility.  This  need  was  critical  for  evaluating 
performance  and  guiding  the  research  in  connection  with  the  voice  pattern-coding 
technique,  as  it  was  considered  essential  that  the  library  of  spectral  patterns  of 
speech  should  fully  accommodate  the  range  of  allophones  of  conversational  speech. 

A  diagnostic  intelligibility  test  method  was  required  in  order  to  assess  whether 
this  objective  had  been  met  and  to  identify  any  deficiencies  in  analysis  and  synthesis 
of  the  various  speech  sounds.  The  Diagnostic  Rhyme  Test  or  DRT  was  developed 
to  fulfill  this  need.  It  quickly  became  apparent  that  the  DRT  combined  unique 
properties  of  resolving  power  and  sensitivity  for  assessing  speech  intelligibility 
performance,  as  well  as  being  extremely  efficient  and  economical  for  obtaining 
detailed  intelligibility  analyses  with  minimum  investments  in  processing  time  and 
listener  crew  time.  An  initial  single-speaker  version  of  the  DRT  which  was  used 
in  a  first  survey  of  the  intelligibility  performance  of  vocoder  technology  in  1967 
was  subsequently  expanded  to  a  multiple-speaker  version  that  was  used  in  a  second 
survey  of  intelligibility  performance  of  vocoder  technology  in  1972.  Since  that 
time  additional  multiple-speaker  versions  of  the  DRT  including  men  and  women 
speakers  have  been  recorded  with  close -talking  dynamic,  carbon,  and  pressure- 
gradient  (noise-cancelling)  microphones;  recording  conditions  have  included  quiet 
environments  and  talkers  in  various  simulations  of  acoustic  noise  environments  of 
interest  in  the  Department  of  Defense  (DOD).  These  intelligibility  test  recordings 
have  been  widely  used  in  the  DOD  for  evaluating  intelligibility  of  a  variety  of  digital 
and  analog  speech  processing  techniques  and  hardware.  Intelligibility  data  obtained 
in  these  tests  has  provided  the  opportunity  to  closely  examine  such  questions  as 
intelligibility  differences  found  among  different  speakers  and  variations  in  intelligi¬ 
bility  of  the  various  phonetic  features. 

1.2  Speaker  Variability 

A  general  finding  has  been  that  intelligibility  scores  are  typically  character¬ 
ized  by  significant  differences  among  speakers.  Examples  of  this  effect  are  pre¬ 
sented  in  Figures  1  and  2  which  present  rankings  of  nine  speakers  (six  male  and 
three  female)  obtained  from  tests  of  three  categories  of  voice  processors,  con¬ 
sisting  of  approximately  a  dozen  each  of  narrowband,  mediumband,  and  wideband 
devices.  The  speaker  scores  presented  are  the  averages  for  all  processors  of 
each  group.  The  Newman-Keuls  test  of  significant  differences  among  means  was 
utilized  in  determining  significant  differences  among  the  mean  scores,  leading  to 
the  rankings  shown  by  the  brackets.  Data  from  the  tests  with  the  high-quality 
dynamic  microphone  indicated  that  the  speakers  fell  into  four  categories  which 
overlapped  except  in  the  case  of  the  narrowband  processors.  Data  obtained  with 
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The  Newman  -  Keuls  test  of  differences  between  means  indicated 
that  Scores  within  brackets  did  not  differ  significantly . 


Figure  1.  Mean  Speaker  Scores  Obtained  with  a  Dynamic  Microphone 
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The  Newman  -  Keuls  test  of  differences  between  means  indicated 
that  Scores  within  brackets  did  not  differ  significantly . 


Figure  2.  Mean  Speaker  Scores  Obtained  with  a  Carbon  Microphone 
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the  carbon  microphone  had  the  affect  of  accentuating  the  significant  differences 
among  the  speakers  as  shown  in  Figure  2. 

The  data  serve  to  highlight  the  necessity  for  standardizing  the  speakers  used 
in  intelligibility  testing.  They  leave  unanswered  the  question  of  the  degree  to  which 
differences  in  intelligibility  performance  might  occur  in  a  large  and  diverse  group 
of  speakers  that  might  be  involved  in  using  a  military  voice  communications  sys¬ 
tem.  These  tests  represented  a  very  limited  sample  of  speakers  in  terms  of 
factors  such  as  age,  regional  dialects,  fundamental  pitch,  and  other  factors  that 
can  be  involved  in  speech  quality.  Considerable  research  remains  to  be  done 
before  it  can  be  determined  whether  tests  with  a  small  group  of  speakers  such  as 
these  provide  intelligibility  performance  data  that  would  be  typical  of  a  much 
larger  population  of  speakers. 

Variations  in  intelligibility  performance  of  different  speakers  are  presented 
in  other  contexts  as  shown  in  Figures  3  and  4,  which  were  obtained  from  tests  of 
a  linear-predictive  vocoder  algorithm  operating  at  2400  bits  per  second  (BPS). 
Figure  3  summarizes  the  effects  of  rsyidom  bit  errors  superimposed  on  the  data 
stream;  linear  regression  equations  were  determined  for  the  relation  between 
overall  intelli  bility  and  the  bit  error  rate  for  six  male  speakers.  The  intelligi¬ 
bility  performance  obtained  for  the  different  speakers  tended  to  diverge  as  the  bit 
error  rate  increased;  analysis  of  variance  confirmed  that  differences  between 
slopes  of  the  regression  lines  for  the  different  speakers  were  highly  significant 
(a  =  0.  001).  The  results  emphasize  the  hazards  of  using  the  trend  in  average 
intelligibility  scores  (across  all  speakers)  for  predicting  intelligibility  expected 
for  any  one  speaker. 

In  Figure  4  the  degradation  of  intelligibility  caused  by  the  speakers  being 
located  in  a  noise  environment  is  summarized.  In  this  experiment  the  effects  of 
the  ambient  noise  in  the  cabin  of  a  jet  aircraft  were  simulated  by  electrically  mix¬ 
ing  a  recording  of  noise  measured  in  the  aircraft,  with  the  recorded  speech  signal, 
prior  to  processing  the  speech  with  the  vocoder  algorithm.  Intelligibility  tests 
were  performed  at  various  signal-to-noise  ratios  and  the  resulting  intelligibility 
data  were  analyzed  to  determine  2d  order  regression  models  relating  the  overall 
intelligibility  and  the  signal-to-noise  ratio.  These  data  need  to  be  interpreted  with 
caution,  since  the  procedure  of  electrically  mixing  the  noise  signal  does  not  accu¬ 
rately  encompass  two  significant  effects  that  would  be  present  in  a  "real  world" 
situation:  a  speaker  would  alter  his  performance  in  order  to  try  to  compensate  for 
the  effect  of  the  noise,  and  the  frequency  spectrum  of  the  interfering  noise  would 
undergo  changes  due  to  the  response  of  the  pressure  gradient  microphone  to  a  far- 
field  source  of  sound.  While  these  effects  would  probably  alter  the  values  shown 
for  the  regression  coefficients,  the  results  show  significant  differences  in  intelligi¬ 
bility  for  the  six  speakers,  differences  that  are  obscured  in  a  single,  overall 
intelligibility  score. 
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BIT  ERROR  RATE  IN  PERCENT  (2400  BPSLPC) 


Figure  3.  Total  Intelligibility  as  a  Function  of  Bit 
Frror  Hate  for  an  LPC  Vocoder  at  2400  BPS  (Six 
Male  Speakers) 


SPEECH  IN  AIRCRAFT  CABIN  NOISE 


Figure  4.  Total  Intelligibility  as  a  Function  of 
Speech-to-Noise  Ratio  in  Jet  Aircraft  Cabin 
Noise  for  an  LPC  Vocoder  at  2400  BPS  (Six 
Male  Speakers) 
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Significant  differences  in  intelligibility  attained  by  different  speakers,  and 
significant  speaker  processor  interactions  ave  been  found  to  be  the  rule  rather 
than  the  exception  in  speech  intelligibility  performance, 

1.3  Phonetic  Feature  Variability 

It  has  been  shown  by  examples  how  large  differences  are  found  among  speaker 
intelligibility  scores.  Even  larger  variations  have  been  found  to  commonly  occur 
among  scores  for  the  various  phonetic  features.  Examples  are  presented  in  Fig¬ 
ures  5  and  6  showing  phonetic  feature  scores  obtained  in  intelligibility  perform¬ 
ance  of  a  linear-predictive  vocoder  (LPC)  operating  at  2400  bits  per  second.  For 
these  examples,  separate  trends  are  shown  for  the  six  primary  phonetic  features 
tested  with  the  Diagnostic  Rhyme  Test:  Voicing,  Nasality,  Sustention,  Sibilation, 
Graveness,  and  Compactness.  (It  will  be  shown  later  in  this  report  that  each  of 
these  features  is  further  subdivided  by  the  test  into  four  feature  states,  among 
which  even  larger  variations  occur.) 

In  Figure  5  linear  regression  lines  are  shown  that  represent  the  variation  in 
intelligibility  score  for  each  of  these  features,  in  relation  to  the  bit  error  rate. 

The  spread  of  the  regression  lines  serves  to  indicate  how  much  deviation  is 
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Figure  5.  Phonetic  Feature  Intelligibility  as  a 
Function  of  Bit  Error  Rate  for  an  LPC  Vocoder 
at  2400  BPS 
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SPEECH  IN  AIRCRAFT  CABIN  NOISE 


Figure  6.  Phonetic  Feature  Intelligibility  as  a 
Function  of  Speech-to-Noise  Ratio  in  Jet  Aircraft 
Cabin  Noise  for  an  LPC  Vocoder  at  2400  BPS 


possible  with  regard  to  the  average  or  "total"  intelligibility  variation  for  the  en¬ 
semble,  as  is  the  case  in  Figure  6  showing  2d  order  regression  curves  for 
phonetic  feature  intelligibility  in  relation  to  the  speech-to-noise  ratio  estimate  for 
speech  in  simulated  jet  aircraft  cabin  noise. 


2.  INTELLIGIBILITY  THRESHOLD  LEVEL  < IT L>  RATING 

It  has  been  shown  that  typically  the  population  of  intelligibility  scores,  the 
aggregate  of  scores  representing  the  intelligibility  of  individual  speaker  phonetic 
feature  combinations,  involves  considerable  dispersion.  Perhaps  because  there 
is  so  much  detailed  information  from  a  single  multi-speaker  intelligibility  test,  it 
has  been  the  practice  to  condense  the  results  to  the  average  score  and  its  standard 
error. 

The  proposed  ITI.  rating  involves  an  interpretation  based  on  the  distribution 
of  scores  obtained  in  a  test.  From  analysis  of  the  distribution,  the  ITL  rating 
estimates  the  intelligibility  value  that  will  be  equaled  or  exceeded  by  a  specified 
percentage  of  the  population  of  scores,  at  a  specified  confidence  level.  For 
example,  the  distribution  of  scores  might  lead  to  an  ITL  estimate  that  there  is 
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95  percent  probability  that  80  percent  of  the  sr  .res  (for  individual  speakers  and 
phonetic  features!  will  equal  or  ex.  eed  70.  (This  ITI.  value  has  been  bserved  in 
connection  with  average  intelligibility  scores  around  90.  ) 

2.1  Rationale  for  the  ITI.  Rating 

The  proposed  ITL  rating  derives  from  inspection  of  many  distributions  of 
intelligibility  scores  and  observations  of  the  large  differences  in  speaker  and 
phonetic  feature  scores  that  have  been  cited.  It  also  derives  from  a  premise  that 
in  critical  military  voice  communications  (and  perhaps  in  other  critical  speech 
communications  such  as  air  traffic  control)  an  intelligibility  assessment  is  required 
that  is  made  in  terms  of  the  level  of  performance  to  be  expected  for  the  majority 
rather  than  for  the  average:  the  majority  of  the  speech  events,  and  the  majority 
of  the  speakers  and  listeners. 

The  average  intelligibility  scores  currently  used  for  specifying  performance 
could  be  expected  to  state  a  value  equaled  or  exceeded  by  half  the  underlying  pop¬ 
ulation  of  scores  (for  speakers  and  phonetic  features)  assuming  that  the  scores 
were  normally  distributed. 

However,  it  has  been  found  that  the  populations  of  scores  from  diagnostic 
intelligibility  tests  typically  are  not  normally  distributed,  but  highly  skewed;  the 
higher  the  average  score,  the  greater  the  extent  to  which  the  distribution  of  scores 
tends  to  be  skewed. 

The  Lilliefors  test  for  conformity  with  a  normal  distribution  has  been  applied 
to  many  distributions  of  intelligibility  data.  Where  the  sample  population  is  made 
up  of  the  phonetic  feature  scores,  whether  of  one  or  several  speakers,  the  null 
hypothesis  (for  conformity  with  the  normal  distribution)  has  invariably  been 
rejected. 

2.2  Examples  of  Distributions  of  Speech  Intelligibility  Data 

An  example  of  a  DRT  data  summary  is  shown  in  Figure  7  consisting  of 
phonetic  feature  scores  for  a  single  speaker  (CH)  obtained  with  a  test  of  contin¬ 
uous-variable-slope  delta  modulation  (CVSD)  at  16  kilobits  per  second.  The  num¬ 
bers  of  listener  errors  (based  on  evaluation  with  a  crew  of  eight  listeners)  that 
provided  the  basi3  for  each  score  are  shown  in  parentheses  after  each  score. 

(A  multi-speaker  intelligibility  test  results  in  a  set  of  scores  such  as  this  from 
each  speaker  of  the  test,  plus  an  overall  summary  listing.  ) 

The  listing  of  Figure  7  indicates  how  each  of  the  six  features  is  tested  in  terms 
of  four  feature  states,  that  is  the  feature  Voicing  involves  separate,  independent 
assessments  of  Voicing  Present  (Frictional),  Voicing  Present  (Non-Frictional), 
Voicing  Absent  (Frictional),  and  Voicing  Absent  (Non-Frictional),  these  being 
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Figure  7.  Summary  of  Diagnostic  Rhyme  Test  (DRT)  In¬ 
telligibility  Scores  Obtained  for  Speaker  CH  with  CVSD  at 
16  kbps 


averaged  four  ways  to  obtain  scores  for  the  effects  of  voicing  being  present  and 
absent,  and  for  frictional  and  non-frictional  effects  of  voicing.  An  overall  average 
estimates  the  total  intelligibility  for  voicing.  1  ^ese  details  help  diagnose  specific 
deficiencies  and  help  in  identifying  possible  causes  and  remedies. 

The  24  scores  for  the  four  states  of  each  of  the  six  features,  shown  within 
dotted  boxes  in  Figure  7,  portray  the  fine  details  of  intelligibility  performance  of 
a  processor.  It  is  among  these  details  that  the  differences  in  voice  processor  per¬ 
formance  are  usually  identified,  and  deficiencies  are  highlighted. 

When  the  24  feature-state  intefligibility  scores  are  rank -ordered  and  plotted 
in  the  form  of  a  cumulative  distribution,  a  plot  of  the  type  shown  in  Figure  8 
-esults.  Here  each  of  the  24  scores  is  represented  by  a  vertical  line  segment 
representing  l/24th  of  the  data  population  and  the  ends  of  adjacent  segments  have 
been  connected  to  form  a  cumulative  distribution  starting  with  the  lowest  score 
which  in  this  example  was  for  Graveness  Absent  (Unvoiced),  the  next  lowest  score 
which  was  for  the  feature  state  Graveness  Present  (Unvoiced),  etc.  Across  the 
top  oF  the  figure  is  shown  a  scale  corresponding  to  the  total  number  of  listener 
errors  associated  with  each  score.  A  normal  ogive  is  also  shown  corresponding 
to  the  values  of  the  mean  (92.  1)  and  the  standard  deviation  (12.4)  associated  with 
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RANKED  DRT  INTELLIGIBILITY  SCORE 
(Speaker  CH,  Feature  Scores) 

Figure  8.  Distribution  of  Phonetic  Feature  Scores 
Obtained  for  Speaker  CH  with  CVSD  at  16  kbps 

this  group  of  data.  The  plot  indicates  that  80  percent  of  the  actual  data  equaled  or 
exceeded  a  score  of  84.4.  A  one-sided  95  percent  confidence  band  for  the  data 
distribution  was  constructed  by  the  method  of  Kolmogorov,  shown  as  a  dashed  pro¬ 
file.  The  confidence  limit  associated  with  the  cumulative  data  distribution  does 
not  intersect  the  dotted  horizontal  line  defining  80  percent  of  the  data  sample;  thus 
we  are  unable  to  make  a  statement  about  an  estimated  80  percent  ITL  at  p  =  0.  95. 

This  example  illustrates  an  important  feature  of  the  ITL  rating:  it  is 
assessed  with  respect  to  a  confidence  limit,  taking  into  consideration  the  size  of 
the  sample  used  in  arriving  at  the  estimate.  Data  from  a  single  speaker  did  not 
provide  an  adequate  sample  size  for  estimating  the  value  for  an  80  percent  ITL  at 
p  =  0.  95. 

Also  shown  in  Figure  8  are  points  representing  values  that  result  from  a  neg¬ 
ative  binomial  probability  distribution  model  for  the  distribution  of  listener 
responses  in  terms  of  listener  errors,  rather  than  conventional  DRT  scores.  In 
these  studies  of  intelligibility  data  distributions,  comparisons  were  made  with  the 
normal,  binomial,  Poisson,  and  negative  binomial  forms.  Of  these,  only  the 
negative  binomial  probability  distribution  was  found  to  provide  a  reasonable 
approximation  to  the  speech  intelligibility  data  distributions. 
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2.3  Negative  Binomial  Probability  Distribution  Model 

The  negative  binomial  probability  model  is  summarized  in  Figure  9.  The  dis¬ 
tribution  is  defined  by  two  parameters,  consisting  of  the  mean  value  m  (  the  mean 
number  of  listener  errors  that  established  feature-state  scores)  and  a  parameter  k 
for  which  an  estimate  can  be  made  based  on  the  mean  and  variance  of  the  data 
population  (in  units  of  'nrs.  of  errors").  These  values  are  obtained  by  a  simple 
transformation  on  the  mean  and  variance  values  obtained  from  conventional  scores. 
No  adequate  tables  of  values  of  the  negative  binomial  probability  distribution  have 
been  found  in  the  literature;  however,  values  can  be  readily  calculated  with  a  pro¬ 
grammable  calculator.  A  program  for  calculating  negative  binomial  values  with  a 
Wang  Model  720  Calculator  :s  presented  in  Appendix  B. 

Figure  10  presents  the  distribution  of  scores  for  the  feature  states  obtained 
from  testing  CVSD  at  16  kbps,  in  which  the  data  for  Speaker  CH  presented  in 
Figures  7  and  8  has  been  combined  with  data  for  eight  additional  speakers.  Shown 
for  comparison  is  a  normal  ogive  based  on  the  mean  score  (90.3)  and  the  standard 
deviation  (14.  9)  for  this  sample  population.  These  values  equate  to  m  3.  11R 
representing  the  average  number  of  listener  errors  per  feature  state,  and  fo  a 
value  of  k  0.496  representing  the  parameters  for  a  negative  binomial  probability 
distribution  model  for  which  points  are  shown  plotted  in  comparison  with  the  data 
distribution. 

It  -an  be  observed  from  the  figure  *nat  80  percent  of  the  actual  data  population 
equaled  or  exceeded  a  value  of  81.3.  An  ITL  is  shown  in  relation  to  the  confidence 
band  for  the  data  distribution:  it  discloses  an  estimate  that  at  p  =  0.  9.3,  80  percent 
of  the  data  population  will  equal  or  exceed  a  score  of  71.  9.  The  difference  between 
the  ITL  value  and  the  actual  80  percent  data  value  takes  into  consideration  the 
sample  size  that  was  the  basis  for  the  determination. 
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Figure  10.  Distribution  of  Phonetic  Feature  Intelligi¬ 
bility  Scores  Obtained  for  Vine  Sptakers,  with  CVSD 
Processing  at  16  kbps 

This  determination  of  the  ITL  and  the  resulting  value  obviously  do  not  take 
into  account  the  identity  of  the  "bad"  intelligibility  scores  that  made  up  the  bottom 
20  percent  of  the  distribution.  On  the  other  hand,  in  critical  voice  communications, 
poor  intelligibility  performance  for  any  of  the  phonetic  features  in  combination 
with  any  of  the  speakers  presents  a  risk  in  terms  of  possible  consequences  because 
of  messages  being  misunderstood,  or  time  lost  while  messages  are  repeated.  An 
ITI.  rating  provides  a  more  meaningful  assessment  of  the  degree  of  that  risk  than 
a  conventional  score  specifying  average  intelligibility  performance.  The  distribu¬ 
tions  of  scores  reveal  that  voice  systems  can  have  relatively  high  average  scores 
and  still  have  a  significant  proportion  of  low  scores;  the  ITL  focuses  on  the  critical 
low  20  percent  (in  these  examples)  and  provides  information  about  the  lower  tail  of 
the  distribution  of  scores. 

If  the  identity  of  "bad1'  scores  is  required  for  the  purpose  of  guidance  for  re¬ 
search  in  improving  intelligibility  performance,  or  for  comparing  different  proc¬ 
essors,  that  information  is  readily  available  from  conventional  listings  of  intelligi¬ 
bility  data. 

It  can  be  observed  in  Figure  10  that  the  negative  binomial  model  gave  good 
agreement  with  the  actual  data  distribution.  The  data  frequencies  are  compared 
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witn  theoretical  frequencies  based  on  the  negative  binomial  model  in  Table  1.  A 
chi -squared  test  comparing  the  data  and  the  model  resulted  in  a  value  of  p  =  0.395; 
the  null  hypothesis  (for  conformity  with  a  negative  binomial  model)  was  not  re¬ 
jected  for  this  data  obtained  from  testing  CVSD  at  If?  kbps. 

In  Figure  11  the  distribution  of  scores  obtained  from  *esfmg  CVSD  at  32  kbps 
is  presented  together  with  the  95  percent  upper  confidence  limit  for  the  empirical 
da’a  distribution.  The  data  indicate  that  there  is  0.  95  probability  that  30  percent 
.if  the  population  of  intelligibility  scores  will  equal  or  exceed  57.  5.  (In  the  actual 
data,  the  threshold  was  93. 8.  )  The  negative  binomial  probability  model  based  on 
the  parameters  observed  in  connection  with  tb’S  data  led  to  an  identical  ITL  esti¬ 
mate:  87.  5.  The  data  frequencies  are  compared  with  the  theoretical  negative 
binomial  model  in  Table  2;  a  chi -squared  test  comparing  the  data  with  the  model 
resulted  in  a  value  of  p  =  0.  44. 

A  third  example  of  the  distribution  of  intelligibility  scores,  from  a  test  with 
nine  speakers,  is  presented  in  Figure  12,  representing  a  summary  of  a  test  of 
CVSD  operating  at  9.  6  kbps.  In  this  case  the  ITL  was  considerably  lower  than  the 
two  previous  examples:  it  is  estimated  that  at  a  confidence  level  of  0.  95,  80  per¬ 
cent  of  the  population  of  scores  will  equal  or  exceed  53,  1.  When  the  average 
intelligibility  scores  are  compared  for  18  kbps  and  9.  6  kbps  CVSD,  the  difference 
was  10.2  points  (90.3  vs  80.  1).  However,  the  difference  in  ITL  values  was  18.8 
points  (71.  9  vs  53.  1). 

The  data  frequencies  are  compared  with  the  theoretical  negative  binomial 
model  based  on  the  parameters  derived  From  the  data  For  9.  6  kbps  CVSD  in  Table  3. 
Again  in  this  case,  a  ralue  of  p  was  obtained  indicating  that  the  negative  binomial 
probability  distribution  gave  a  reasonable  approximation  to  the  data  distribution. 

As  with  the  previous  examples,  the  negative  binomial  model  led  to  an  identical  ITL 
value  (53.  1)  as  obtained  with  the  empirical  data  distribution. 

Intelligibility  data  frequencies  obtained  from  testing  an  LPC-10  vocoder 
algorithm  (LPC-23*)  operating  at  2400  BPS,  with  six  male  speakers  and  two  inde¬ 
pendent  presentations  to  the  listener  crew  are  summarized  in  the  distribution  of 
scores  presented  in  Figure  13.  Also  shown  are  the  normal  and  negative  binomial 
forms  based  on  the  parameters  for  this  distribution  and  the  upper  one-sided 
95  percent  confidence  belt  for  the  empirical  data  distribution.  From  the  confidence 
limit,  an  ITL  estimate  for  this  processor  was  obtained:  there  is  a  95  percent 
probability  that  80  percent  of  the  population  of  intelligibility  scores  will  equal  or 
exceed  75.  0. 

Table  4  compares  the  data  frequencies  and  the  theoretical  negative  binomial 
frequencies  based  on  the  parameters  calculated  for  this  set  of  data.  The  value  of 
chi -squared  and  corresponding  value  of  p  0.  0.98  indicate  not  to  reject  the  null 
■  ypothes  s  for  conformity  of  the  data  distribution  with  the  negative  binomial  form. 
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Table  1.  Comparison  of  intelligibility  Data  Frequencies  and  Negative  Binomial  Probability  Distribution 
(m  3.  116,  k  0. 4962,  N  216) 
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Table  2.  Comparison  of  Intelligibility  Data  Frequencies  and  Negative  Binomial  Probability  Distribution 
(m  1.  598,  k  0. 39ti2,  N  21(0 
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able  3.  Comparison  of  Intelligibility  Data  Frequencies  and  Negative  liiiuuni.il  Frol  >  1  > 1 1  it  \  Distribut mil 
n  6. 36fi,  k  0.8l>7(>,  N  21(0 
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CVSD  at  9.6  kbps.  Mean  DRT  Score:  80.11  S.D.  :  22.77  N  216  ibM  6.  3Fem. Speakers) 
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( individual  Speakers  by  Features) 

Figure  11.  Distribution  of  Phonetic  Feature  Intelligi¬ 
bility  Scores  Obtained  for  Nine  Speakers  with  CVSD 
Processing  at  32  kbps 
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Figure  12.  Distribution  of  Phonetic  Feature  Intelligi¬ 
bility  Scores  Obtained  for  Nine  Speakers  with  CVSD 
Processing  at  9.  6  kbps 
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Figure  13.  Distribution  of  Phonetic  Feature  Intelligi¬ 
bility  Scores  Obtained  for  Six  Speakers  (Two  Presen¬ 
tations)  with  an  LPC  Vocoder 

The  intelligibility  threshold  level  (ITL)  estimate  calculated  with  the  theoreti¬ 
cal  cumulative  probability  associated  with  the  negative  binomial  model  resulted  in 
an  identical  value  (73.0)  as  that  obtained  from  the  empirical  data  distribution. 

An  example  based  on  intelligibility  data  obtained  with  another  type  of  voice 
processor,  a  sixteen-channel  vocoder  operating  at  2400  BPS  tested  with  six  male 
and  three  female  speakers,  is  presented  in  Figure  14.  Comparing  the  data  dis¬ 
tribution  for  the  l.PC-10  vocoder  algorithm  (Figure  13)  illustrates  again  that  dif¬ 
ferences  in  ITL  ratings  tend  to  be  greater  than  those  observed  for  mean  intelligi¬ 
bility  scores.  The  channel  vocoder,  with  a  mean  score  of  83.0,  was  approximately 
six  points  lower  than  the  mean  intelligibility  score  obtained  for  the  LPC-10  vocoder 
algorithm.  However,  the  difference  in  ITL  ratings  for  the  two  processors  was 
more  than  15  points  (75.0  vs.  59.4). 

The  values  associated  with  the  data  and  the  negative  binomial  model  are  listed 
in  Table  5,  together  w  ith  the  results  of  a  chi -squared  test  comparing  the  empirical 
data  frequencies  and  the  negative  binomial  model.  In  this  case,  the  value  of  p  was 
0.02.'),  and  the  null  hvpothesis  was  rejected.  Although  the  data  represented  a  poor 
fit  to  the  negative  binomial  form,  the  ITL  calculated  from  the  negative  binomial 
model  differed  b'  nlv  >ne  quantum  level  in  the  data  (ITL  of  62.5  from  the  model 
vs.  59.4  from  the  actual  data  distribution). 
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Table  5.  Comparison  of  Intelligibility  Data  Frequencies  and  Negative  Binomial  Probability  Distribution 
(m  =  5.455,  k  =  0.9910,  N  =  216. 
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Sixteen-Channel  Vocoder  at  2400  BPS.  Mean  DRT  Score:  82.93  S.D.:  18. b2 
N  =  216  (Six  Male  &  three  Female  Speakers) 
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ranked  drt  intelligibility  score 

(Individual  Speakers  by  Features) 

Figure  14.  Distribution  of  Phonetic  Feature  Intelligi¬ 
bility  Scores  Obtained  for  Nine  Speakers  with  a  Six¬ 
teen-Channel  Vocoder  at  2400  BPS 


2.4  Results  of  Chi-Squared  Tests  for  Conformity  of  Intelligibility  Data 
Distributions  with  Negative  Binomial  I’rohahility  Model 

To  further  test  conformity  of  intelligibility  data  distributions  with  a  negative 
binomial  probability  model,  a  series  of  chi -squared  tests  were  performed  com¬ 
paring  empirical  data  distributions  with  the  negative  binomial  probability  model 
on  81  sets  of  speech  intelligibility  data  on  hand  from  prior  tests  and  evaluations 
of  various  speech  processor  algorithms  and  hardware.  These  tests  included  data 
from  the  combinations  of  speakers  and  processors  summarized  in  Table  G,  rang¬ 
ing  from  data  for  a  single  speaker  (both  male  and  female  speakers)  to  as  many  as 
twelve  male  speakers,  the  processors  including  LPC  and  channel  vocoders  and 
CVSD  at  three  data  rates.  Intelligibility  data  were  analyzed  separately  for  voiced 
and  unvoiced  feature  data  as  well  as  total  data  summaries.  The  results  of  this 
exploratory  study  are  summarized  in  Figure  15  which  presents  the  distribution  of 
the  values  of  probability  based  on  the  values  of  chi-squared  in  conjunction  with  the 
degrees  of  freedom.  Almost  one-fourth  of  the  81  cases  resulted  in  values  of  p 
less  than  0.05,  for  which  the  null  hypothesis  would  be  rejected.  Thus  the  agree¬ 
ment  with  the  negative  binomial  probability  model  was  far  from  perfect.  Many  of 
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Table  6.  Summary  of  Intelligibility  Data  Sets  Tested  for  Conformity 
with  the  Negative  Binominal  Probability  Model 


DRT  Intelligibility  Data  Tested  for  Conformity 
with  the  Negative  Binomial  Distribution 

Single  Speaker  (M:Fem)  Three-Speaker  (3  M;  3  Fern) 
Six  M.  Speakers  (two  versions) 

Nine  Speakers  (fi  M,  3  Fern) 

Twelve  M.  Speakers 

LPC  Vocoders;  Channel  Vocoders;  A  PC;  CVSD 
(3  rates);  Hybrid  Vocoders. 

NR  of  Listeners:  Eight 

Data  Sets:  Feature  scores  (by  individual  speakers) 

-Voiced  features 
-Unvoiced  features 
-All  features 

Total  DRT  scores  (by  listeners/speakers) 

Total  Data  Groups  Used  for  C'hi-Squared  Tests:  81 
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Figure  15.  Summary  of  Chi-Squared  Tests  for  Con¬ 
formity  of  Speech  Intelligibility  Data  Distributions 
with  a  Negative  Binomial  Probability  Model 
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the  data  sets  represented  composite  results  from  two  separate  presentations  to 
the  listener  crews.  It  was  subsequently  found  that  a  curious  trend  was  present  in 
the  data  sets  in  which  the  composite  data  (for  two  or  more  presentations  to  lis¬ 
teners)  showed  poorer  agreement  with  the  model:  data  from  the  first  presentation 
to  listeners  agreed  with  the  negative  binomial  model  in  all  of  these  cases.  Further 
details  of  this  anomaly  are  discussed  in  the  following  section. 

A  comparison  of  ITL  values  obtained  from  the  cumulative  data  distributions, 
and  ITL's  obtained  with  the  use  of  the  negative  binomial  probability  model,  showed 
good  agreement,  even  for  data  sets  that  deviated  significantly  from  the  model. 

2.5  Phenomenon  Observed  in  Connection  with  Replication  of  Listening 

Tests 

In  the  speech  intelligibility  data  analyzed  for  conformity  with  the  negative 
binomial  probability  model  many  of  the  data  sets  represented  composite  results  of 
presentation  of  the  DRT  recordings  to  the  listener  crew  on  two  separate  occasions 
a  week  or  more  apart;  one  set  of  data  involved  three  presentations  at  intervals  of 
a  month.  The  initial  chi-squared  tests  reported  in  the  previous  section  were  per¬ 
formed  on  composite  data  combined  over  all  presentations  of  a  particular  DRT 
recording.  Subsequently  in  the  studies,  separate  assessments  were  made  for  the 
data  resulting  from  each  separate  presentation  to  listeners.  A  pattern  was  found 
in  the  results:  agreement  of  the  intelligibility  data  with  the  negative  binomial  form 
was  almost  always  higher  for  the  first  presentation  to  listeners,  than  for  subse¬ 
quent  presentations.  These  findings  are  presented  in  Table  7. 

The  listener  tests  were  performed  "blind,  "  that  is,  the  listening  crew  had 
no  knowledge  as  to  the  identity  of  any  particular  speech  processor  or  the  process¬ 
ing  conditions.  Any  given  test  was  always  interspersed  with  other  tests  in  a  ran¬ 
dom  manner.  The  listeners  had  much  prior  experience  with  these  scramblings  of 
the  DRT  word  lists,  and  while  there  was  a  slight  tendency  for  scores  to  increase 
at  the  second  presentation,  in  most  instances  the  change  was  not  statistically 
significant. 

Because  of  these  considerations  it  is  difficult  to  conceive  any  explanation  as 
to  why  the  intelligibility  data  distributions  tended  to  show  increasing  deviation 
from  the  negative  binomial  probability  form  with  the  second  and  subsequent  presen- 
tat  ions. 


1  ItKTKKMIN  VTION  OI  IMU.I  K.IHII.m  THRKSHOLI)  I.K.VK1.S  (ITI.  «) 

It  was  established  that  the  distributions  of  diagnostic  intelligibility  scores 
were  sufficiently  <  lose  to  the  negative  binomial  probability  model  that  two  alterna¬ 
tive  procedures  for  determining  ITL.'s  were  available. 
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3.1  Determination  of  ITI.'s  from  Cumulative  Data  Distributions 

In  this  procedure  the  sample  population  of  diagnostic  intelligibility  scores  is 
rank-ordered,  and  the  ranked  data  are  utilized  in  constructing  a  cumulative  table 
of  scores  in  relation  to  cumulative  percentiles  of  the  data  population.  In  these 
studies  a  program  was  written  for  the  Wang  Model  720  Calculator  to  perform  the 
ranking.  Data  values  were  entered  into  an  auxiliary  memory  via  the  calculator 
keyboard,  and  a  program  subsequently  rearranged  the  table  in  rank-order.  Each 
datum  was  paired  with  a  code  number  to  maintain  identification  of  the  phonetic 
feature  and  speaker  associated  with  each  value. 

A  confidence  band  for  the  distribution  was  established  by  the  method  of 
Kolm  gurov,  in  which  a  one-sided  confidence  band  is  defined  by  a  simple  offset  of 
population  percentiles  by  an  appropriate  quantile  that  is  a  function  of  the  value  of 
p  and  the  sample  size.  Kolmogorov's  method  assumes  a  random  sample  XI, 

X2,  ....  Xn  of  size  n  associated  with  some  unknown  distribution  function  Fix). 

For  the  confidence  coefficient  to  be  exact,  the  method  requires  that  the  samples 
are  from  a  continuous  distribution:  however,  if  the  random  variables  are  discrete 
(as  in  this  .  ase)  the  confidence  band  is  conservative,  that  is,  the  '’true”  but  un- 
kn<  in  confidence  coefficient  is  greater  than  the  stated  one. 

Kolmogorov’s  method  is  most  readily  used  by  constructing  a  tabular  listing  or 
a  graphical  r  epresentation  of  the  empirical  distribution  function.  In  a  graphical 
representation,  as  in  Figures  8  and  9  and  the  examples  in  Appendix  A,  each  datum 
can  be  considered  as  a  vertical  segment  that  establishes  ICO  n  percent  of  the  total 
population:  end  points  of  adjacent  segments  are  joined  to  form  the  distribution 
function  S(x)  which  is  terminated  at  zero  and  100  percent  of  the  sample.  A  tabular 
listing  as  in  Table  4  can  be  utilized  for  listing  cumulative  percentiles  representing 
the  end  points  or  boundaries  of  each  segment  (datum). 

A  confidence  band  with  confidence  coefficient  1-a  is  created  with  the  use  of 
the  1-a  quantile  from  a  table  of  the  Kolmogorov  test  statistic  (Appendix  D).  In 
determining  ITI.'s,  the  upper  one-sided  confidence  band  is  of  interest,  which  is 
formed  by  vertical  displacement  of  the  empirical  distribution  function  graph  by  the 
value  of  Q.  from  a  table  of  the  Kolmogorov  test  statistic.  Thus  the  confidence 
boundary  is  an  exact  replica  of  the  empirical  distribution,  offset  bv  an  appropriate 
amount,  and  terminated  at  the  values  Q.  and  100  percent  illustrated  in  the  exam- 
pies  of  Figures  8  and  9.  Alternatively,  the  value  of  Qj.a  is  added  (or  subtracted 
for  a  lower  one-sided  limit)  from  the  cumulative  percentiles  of  the  tabular  listing. 
If  the  confidence  limit  is  denoted  by  U(x), 

F(x)  -  Six)  +  Q. 

I  -a 


ill 


forming  the  boundary  of  a  one-sided  1-a  confidence  band  which  completely  contains 
the  "true"  F(x). 

Values  for  ITL's  for  various  percentages  of  the  data  population  are  deter¬ 
mined  in  relation  to  the  upper  one-sided  confidence  limit,  either  graphically  as 
illustrated  in  Figure  9,  or  with  more  accuracy  from  a  tabular  listing  as  in  the 
example  of  Table  4. 

Diagnostic  intelligibility  data  populations  presented  here  involved  24  intelligi¬ 
bility  scores  (for  the  phonetic  feature  states)  from  each  of  the  speakers  in  an 
intelligibility  test.  With  two  and  more  speakers,  the  number  of  samples  exceeded 
40,  and  the  approximation  for  calculating  the  one-sided  quantile  of  the  Kolmogorov 
test  statistic  at  p  =  0.  95  was  used: 

For  n  >  40, 

Qq.95  =  • 

3.2  Determination  of  ITL’s  from  a  Negative  Binomial  Probability  Model 

After  a  mean  DRT  score  and  its  variance  have  been  determined  for  a  set  of 
intelligibility  data,  conversions  to  parameters  m  and  k  that  characterize  a  nega¬ 
tive  binomial  probability  model  are  as  follows: 

Mean  nr.  of  listener  errors  =  m  (100  -  D)/3.  125 

where  D  =  mean  DRT  score 

2 

Listener  error  variance  =  s 

e 

2 

where  S„  =  DRT  score  variance 

s}e 

Estimate  of  k  =  k  = 

Here  is  an  example  of  the  derivations  for  the  data  of  Figure  9  and  Table  1,  from 
intelligibility  data  from  testing  CVSD  at  16  kbps  with  nine  speakers. 

Mean  DRT  score  D  =  90.26  DRT  score  variance  SjJj  222.06  n  216 
Consequently 

100  -  90. 26  „  .  „  ,  ... 

m  - ^ — j-p -  3.  12  (mean  listener  errors  per  score) 


=  Sq/(3.  1 2  5) 2 
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2 

s 

e 


22.  74  (listener  error  variance) 


222.  06 
(3.  125)2 

leading  to  the  estimate  for  k: 


*  (3.12)2 

22.  74  -  3.12 


0.  496 


Using  these  values  in  calculating  probabilities  based  on  a  negative  binomial 
distribution: 

0.373  =  probability  of  no  errors 


k  x 

p  q  x  =  nr  of  errors  =  0,  1,  2,  .  •  . 

(0.  3731)  (0.  863)x 

resulting  in  the  probability  values  and  distribution  function  presented  in  Table  1. 

A  confidence  band  for  the  negative  binomial  distribution  function  can  be  formed 
with  the  same  Kolmogorov  test  statistic  and  procedure  used  with  the  empirical 
data  distributions;  the  Kolmogorov  test  statistic  is  valid  without  regard  to  the  form 
of  the  distribution.  Thus  a  confidence  band  for  the  negative  binomial  distribution 
function  is  formed  by  a  displacement  of  the  percentiles  associated  with  the  distri¬ 
bution  function  by  an  appropriate  value  of  Qj_a  for  the  Kolmogorov  test  statistic. 

A  number  of  intelligibility  data  distributions  are  shown  in  Appendix  A,  together 
with  95  percent  one-sided  confidence  bands  and  ITL  values. 


P  0.  137  Pk 

'  m  +  k  r 


^  m  +  k 


-  0.  863 


p[x;k,  m] 


x  +  k  -  1 
x 


x  -  0.  504 
x 


t.  INTELLIGIBILITY  THRESHOLD  LEVEL  (ITL)  RATINGS  FOR  SOME 
VOICE  PROCESSORS 


Conventional  intelligibility  scores  are  compared  with  ITL  values  in  Tables  8, 

9,  and  10,  and  illustrate  that  differences  in  intelligibility  scores  usually  become 
magnified  in  differences  in  ITL's  for  the  same  intelligibility  data.  The  data  also 
illustrate  that  typically  there  are  large  differences  between  the  intelligibility  scores 
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Table  8.  Intelligibility  Threshold  Level  (ITL)  Ratings  for  LPC  Vocoder 
Algorithms  Operating  with  Random  Bit  Errors 


Effects  of  Random  Bit  Errors  on  Serial  2400  BPS 
Linear  Predictive  Vocoders 

based  on  six  male  speakers  and  two  presentations  to  eight  listeners 


80%  Intelligibility  Threshold  Level 


Test 

Mean 

(p  =  0.  95)* 

Configuration 

Score 

Voiced 

Unvoiced 

Total 

LPC  Vocoder 

Zero  BER 

90.  9 

81.3 

68.  8 

75.  0 

1% 

86.  0 

68.  8 

53.  1 

65.  6 

3% 

77.  1 

53.  1 

40.  6 

53.  1 

5% 

68.  6 

40.  6 

31.3 

40.  6 

Piecewise-LPC 

Vocoder 

Zero  BER 

92.4 

78.  1 

71.  9 

78.  1 

1% 

88.  0 

71.  9 

65.  6 

71.  9 

3% 

80.  6 

62.  5 

43.  8 

62.  5 

5% 

71.2 

50.  0 

31.3 

43.  8 

it 

There  is  95%  Confidence  that  80%  of  the  specified  population  of  diag¬ 
nostic  intelligibility  scores  (feature  scores,  by  Speakers)  will  equal 
or  exceed  the  stated  value. 


Table  9.  Comparisons  of  Conventional  Intelligibility  Scores  and  ITL's 
from  Diagnostic  Rhyme  Test  Scores,  6  Male  and  3  Female  Speakers 
and  8  Listeners 


80%  Intelligibility  Threshold 

Level 

Mean 

(95%  Confidence)* 

Processor 

Score 

Voiced 

Unvoiced 

Total 

CVSD  9.  6  kbps 

80.  1 

62.  5 

25.  0 

53.  1 

CVSD  16  kbps 

90.3 

81.3 

56.  3 

71.  9 

CVSD  32  kbps 

95.0 

90.  6 

71.9 

87.  5 

Ch.  Vocoder  2400 

83.  0 

62.  5 

3  7.  5 

59.  4 

* 

There  is  95%  confidence  that  80%  of  the  population  of  diagnostic  intelligi¬ 
bility  scores  (feature  scores,  by  speakers)  will  equal  or  exceed  the 
stated  value. 


34 


Table  10.  Comparisons  of  ITL's  Derived  from  Empirical  Data  Distributions 
and  Obtained  with  the  Negative  Binomial  Probability  Model 


Comparison  of  80%  ITL's  (p  =  0.  95) 

Values  from  data  distribution,  vs.  Negative  Binomial  model 
(Data  from  6  M.  &  3  Fern.  Speakers,  8  Listeners) 


Processor 

CVSD  9.  6  kbps 
CVSD  16  kbps 
CVSD  32  kbps 
Ch.  Voc.  2400 
Ch.  Voc.  2400 
A  PC -4  6400 


Voiced  Features 


Data 

(Model) 

62.  5 

(65.  6) 

81.  3 

(81.3) 

90.  6 

(90.  6) 

62.  5 

(65.  6) 

65.  6 

(65.  6) 

68.  8 

(68.  8) 

(Invoiced 


Data 

(Model) 

25.  0 

(28.  1) 

56.  3 

(53. 1) 

71.  9 

(71.  9) 

37.  5 

(40.  6) 

56.  3 

(56.  3) 

40.  6 

(40.  6) 

T  otal 


Data 

(Model) 

53.  1 

(53.  1) 

71.  9 

(75.  0) 

87.  5 

(87.  5) 

59. 4 

(62.  5) 

65.  6 

(65.  6) 

59.  4 

(62.  5) 

for  the  voiced  and  for  the  unvoiced  speech  sounds,  in  comparison  with  the  total 
ensemble,  and  highlight  the  fact  that  the  greatest  potential  payoff  in  improving 
speech  intelligibility  will  come  through  improving  the  fidelity  of  the  unvoiced 
speech  events. 

In  Table  8,  two  linear-predictive  (LPC)  vocoder  algorithms  are  compared  in 
terms  of  their  intelligibility  performance  in  the  presence  of  random  bit  errors  as 
could  be  caused  by  interference  or  low-grade  transmission  channels,  when  no 
measures  are  provided  for  error  protection. 

Table  9  compares  conventional  intelligibility  scores  and  ITL's  for  continuous 
variable-slope  delta  modulation  (CVSD)  at  three  data  rates,  and  a  conventional 
channel  vocoder. 

Table  10  compares  the  ITL's  obtained  from  the  empirical  data  distribution 
with  the  values  obtained  with  the  use  of  the  negative  binomial  probability  model. 
The  greatest  discrepancy  in  ITL  values  was  a  one-quantum  change  in  the  value 
(3.  125  points);  over  half  of  these  eighteen  comparisons  gave  perfect  agreement 
in  ITL  values  assessed  by  the  two  method. 


5.  CONCLUSIONS  AND  RECOMMENDATIONS 

Intelligibility  scores  for  voice  processors  have  been  found  to  be  typically 
characterized  by  highly  significant  differences  among  speakers,  as  well  as  highly 
significant  differences  among  scores  for  the  various  phonetic  features. 
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Distributions  of  intelligibility  scores  are  not  normally  distributed,  but  highly 
skewed. 

A  negative  binomial  probability  distribution  was  found  to  give  good  agreement 
with  empirical  intelligibility  data  distributions. 

A  new  performance  rating  for  voice  communications  devices,  termed  an 
Intelligibility  Threshold  Level  (ITL),  was  conceived  as  a  means  of  taking  these 
findings  into  consideration  in  establishing  a  measure  of  performance  that  is  an 
estimate  of  an  intelligibility  value  that  the  majority  (rather  than  the  simple  average) 
of  intelligibility  scores  for  a  voice  processor  will  equal  or  exceed,  at  a  specified 
confidence  level  established  in  relation  to  the  sample  size  used  in  obtaining  the 
rating. 

It  is  proposed  that  an  ITL  rating  is  a  more  meaningful  assessment  of  the  de¬ 
gree  of  risk  involved  in  misunderstanding  voice  messages,  or  causing  time  to  be 
lost  in  requiring  messages  to  be  repeated. 

It  was  shown  that  ITL's  can  be  determined  by  two  alternative  methods:  by 
rank-ordering  the  intelligibility  scores  for  a  voice  processor  and  constructing  the 
cumulative  distribution  of  data  and  its  confidence  band,  or  by  using  a  negative 
binomial  probability  model  for  the  distribution. 

Chi-squared  tests  indicated  that  in  most  cases  the  negative  binomial  probabil¬ 
ity  model  gave  a  reasonable  approximation  to  the  data  distribution. 

Intelligibility  Threshold  Levels  (ITL's)  estimated  with  the  negative  binomial 
model  differed  by  at  most  one  quantum  value  (3.  125)  from  ITL's  determined  from 
the  empirical  distributions. 

It  is  recommended  that  future  speech  intelligibility  tests  and  evaluations  of 
digital  voice  communications  processors  and  systems  include  a  determination  of 
the  80  percent  ITL's  at  0.95  probability,  that  is  determine  the  intelligibility  level 
for  which  there  is  a  95  percent  probability  that  80  percent  of  the  population  of 
intelligibility  scores  (for  individual  speakers  and  phonetic  features)  will  equal  >r 
exceed. 
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Appendix  A 

Voice  Processor  Intelligibility  Data  Distributions  and 
Intelligibility  Threshold  Levels  (ITL's) 
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CUMULATIVE  PERCENT  OF  RANKED  UNVOICED  DATA  CUMULATIVE  PERCENT  OF  RANKED  VOICED  DATA 


Nr  of  Listener  Errors 

32  24  16  6  0 


(Individual  Speakers,  Voiced  Feature  Scores) 

gure  Al.  a.  CVSD  at  16  kbps:  Voiced  Intelligibility 
atures 


Nr  of  Listener  Errors 

32  24  16  6  0 


(Individual  Speakers,  Un voiced  Feature  Scores) 

Figure  Al.b.  CVSD  at  16  kbps:  Unvoiced  Intelligi¬ 
bility  Fea'ures 


0 


CUMULATIVE  PERCENT  OF  RANKED  UNVOICED  DATA  nrj  N)  CUMULATIVE  PERCENT  OF  RANKED  VOICED  DATA 


Nr  0*  Listener  Errors 

32  24  .6  6  0 


(Individual  Speakers,  Voiced  Feoture  Scores ) 

gure  A 2.  a.  CVSD  at  32  kbps:  Voiced  Intelligibility 
?atures 


Nr  of  L-sten»r  Errors 

32  24  16  6 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(Individual  Speakers,  Unvoiced  Feature  Scores  I 


Figure  A2.b.  CVSD  at  32  kbps:  Unvoiced  Intelligi 
bility  Features 


24 


e 


o 


of  L  stone'  Error s 
1 6 


RANKED  DRT  INTELLIGIBILITY  SCORE 
( Individual  Speakers,  by  Features  ) 


Figure  A3.  LPC  Vocoder  at  2400  BPS:  Three  Male 
Speakers  (LL,  RH,  CH) 


42 


CUMULATIVE  PERCENT  OF  RANKED  OAT 


Figure  A4.b.  LPC  Vocoder  at  2400  BPS  with  5  Per 
cent  Random  Bit  Errors  (Six  Male  Speakers) 


Nr  of  Listener  Errors 

32  24  '6  d  0 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(Individual  Speakers,  by  Features) 

Figure  A  5.  a.  LPC  Vocoder  at  2400  BPS:  Twelve  Male 
Speakers 

Nr  of  Listener  Errors 

32  24  16  6  0 


RANKED  DRT  INTELLIGIBILITY  SCORE 
I Individual  Speakers,  Voiced  Feature  Scorn) 

Figure  A5.  b.  LPC  Vocoder  at  2400  BPS:  Voiced  Intel¬ 
ligibility  Features  (Twelve  Male  Speakers) 
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Nr  of  Listener  Errors 

32  24  16  8  0 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(Individual  Speakers,  Unvoiced  Feature  Scores  I 

Figure  A5.  c.  LPC  Vocoder  at  2400  BPS:  Unvoiced 
Intelligibility  Features  (Twelve  Male  Speakers) 


Nr.  of  Listener  Errors 


32  24  16  8  0 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(  Individual  Speakers,  by  Features) 


Threshold 

for 

upper 

80% 


Figure  Afi.a.  TRIVOC  Vocoder  at  2400  BPS:  Six  Male 
Speakers 
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Nr  Of  LiStere'  Errors 

32  24  16  8  0 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(Individual  Speakers,  Voiced  Feature  Scores ) 

Figure  A6.b.  TRIVOC  Vocoder  at  2400  BPS:  Voiced 
Intelligibility  Features,  (Six  Male  Speakers) 


Appendix  B 

Wang  Model  720  Calculator  Program  for  Negative 
Binomial  Probability  Distribution 


This  program  sequence  is  designed  to  operate  in  conjunction 
with  some  utility  subroutines  for  storing  and  retrieving  the 
contents  of  the  calculator  x&y  registers,  and  for  calculating 
and  executing  plots  in  conjunction  with  the  Wang  Model  702 
Printer/Plotter. 

To  initialize,  values  of  m,  s,  n,  and  the  constant  for 
converting  between  listener  errors  and  DRT  score,  are 
entered  from  the  keyboard  as  follows: 

m  into  register  00b, 

s  (standard  deviation  in  listener  errors)  into  reg.  044 
n  into  register  002 

c  (for  converting  error  count  to  DRT  score)  into  reg  001 

Load  a  zero  into  reg  032  to  obtain  a  listing. 

Load  a  2  into  reg  032  to  obtain  a  plot. 

Position  the  Model  702  Plotter  Printer  at  the  top  of 
the  page  for  a  listing;  at  the  origin  for  a  plot. 

Execute  "Ec»»ch  1515". 
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Wang  Calculator  Program  -  continued. 


0408 

MARK  Start  Neg. 

0415 

recall  y 

1515 

1515  Binomial  Routine 

0003 

003 

ln  (p) 

0700 

0 

0405 

recall  dir 

0404 

store  dir 

0001 

001 

k* 

0008 

008 

0602 

multiply 

0405 

recall  dir 

0414 

store  y 

0404 

044  s 

0011 

Oil 

0713 

square 

0405 

recall  dir 

0b04 

t 

0012 

012 

Xi 

0405 

000b 

recall  dir 

OOo  m 

0412 

Write  A 

Skip  if 

0711 

Ch  Sign 

Xi  =  0 

0601 

■ 

0407 

Search 

0713 

square 

1411 

1411 

0606 

exch.  x&y 

0415 

Recall  Y 

0603 

divide 

0004 

004 

Ln  (q) 

0605 

0o02 

multiply 

0607 

|x| 

0o05 

1 

0604 

t 

store  Y 

0400 

+  direct 

0414 

0011 

Oil 

0001 

001  k* 

0408 

MARK 

0405 

recall  dir 

1415 

0415 

1415 

recall  y 

0006 

006  m 

0001 

001 

k* 

0606 

exch.  x&y 

0405 

recall  dir 

0600 

+ 

0012 

012 

Xi 

0606 

exch.  x&y 

0600 

+ 

0603 

divide 

0701 

1 

0605 

i 

0601 

m 

0611 

Ln  x 

0605 

1 

0404 

store  Hir 

0611 

Ln  x 

0003 

0701 

003  Ln  (p) 

1 

0400 

+  direct 

0606 

exch.  x&y 

0011 

011 

0601 

_ 

0405 

recall  dir 

0605 

t 

0012 

012 

Xi 

06t  1 

Ln  x 

0611 

Ln  x 

0404 

store  dir 

0401 

-  direct 

0004 

004  Ln  (q) 

0011 

Oil 

0408 

MARK 

0701 

1 

1412 

1412 

0401 

-  direct 

0405 

recall  dir 

0012 

012  deer.  Xi 

0009 

009  Xi 

0415 

recall  y 

0404 

store  dir 

0012 

012 

Xi 

0012 

012 

0701 

1 
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Wang 

Calculator  Program  -  continued. 

0508 

Skip  if  y  <  x 

0405 

rec  direct 

0407 

Search 

0011 

011 

1415 

1415 

0614 

ex 

0405 

rec  direct 

0411 

Write 

0011 

Oil 

0105 

(1.5)  p(m,x,k) 

0814 

ex 

0604 

t 

04  00 

+  direct 

0405 

rec  direct 

0008 

008 

0002 

002  N 

0415 

rec  Y 

0602 

multiply 

0008 

008 

0606 

exch  x&y 

0405 

rec  direct 

0411 

Write 

0009 

009 

0403 

(4.3)  Neg. Binomial  Frequency 

0004 

SRistore  x&y)  0004 

0405 

rec  direct 

0415 

rec  Y 

0008 

008 

0302 

032 

0411 

Write 

0702 

2 

0105 

(1.5)  Cumulative  p 

0508 

Skip  if  Y  <  X 

0602 

multiply 

0100 

Execute  SR0l00(calc . plot) 

0605 

1 

0114 

Execute  SR0114(execute  plot) 

0411 

Write 

0415 

rec  Y 

0403 

(4.3)  Cumulative  Frequency 

0302 

032 

0408 

Mark 

0702 

2 

1514 

1514 

0508 

Skip  if  Y<X 

0701 

1 

0407 

Search  (Jump  if  flag  <  2) 

0400 

+  direct 

1514 

1514 

0009 

009  incr.  x 

0015 

Execute  SR0015  (CR/LF) 

0415 

rec  Y 

0005 

Execute  SR0005(recal 1  x&y) 

0203 

023  Xmax 

0604 

t 

0600 

+ 

0405 

rec  direct 

0600 

+ 

0000 

000  (score  incr.) 

0600 

+ 

0602 

multiply 

0600 

+  (Xmax  +  4) 

0701 

l 

0405 

rec  direct 

0700 

0 

0009 

009  xi 

0700 

0 

0508 

Skip  if  Y  <  X 

0606 

exch  x&y 

0407 

Search  (loop  for 

0601 

subtract 

1412 

1412  next  x^) 

0605 

i 

0515 

STOP 

0411 

Write 

0302 

(3.2)  (Equivalent  DRT  Score) 

0005 

Execute  SR0005(recal 1  x&y) 

0411 

Write 

0200 

(2.0)  (Nr  of  listener  errors) 
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Appendix  C 

The  Intelligibility  Threshold  Level  (ITL) : 
A  new  Approach  for  Evaluating  Performance  of 
Digital  Speech  Communications  Processors 

Caldwell  P.  Smith 


Reprinted  from  ASA *50  Speech  Communication  Preprint  Experiment, 
Acoustical  Society  of  America,  June  1979 
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Intelligibility  performance  of  voice  processors  has  been  typically  specified  in 
average  intelligibility  scores:  values  presumably  equaled  or  exceeded  by  half  the  under¬ 
lying  populations  of  scores,  assuming  a  normal  distribution.  However,  extensive  multi¬ 
speaker  testing  of  a  wide  variety  of  processors  over  the  past  decade  has  shown  conclus¬ 
ively  that  (1)  populations  of  scores  for  processors  typically  deviate  significantly 
from  normal;  (2)  mean  scores  of  individual  talkers  in  multi-speaker  tests  typically 
show  highly  significant  differences  (a  =  .001),  and  (3)  significant  differences  among 
the  mean  scores  for  phonetic  features  are  also  typical.  To  further  an  implied  objective 
of  intelligibility  testing:  estimation  of  a  level  equaled  or  exceeded  by  the  majority 
of  scores,  for  example,  80?  of  a  population  of  scores  for  talkers  and  phonetic  features, 
a  new  approach  has  been  established  for  evaluation,  in  which  intelligibility  data  is 
analyzed  to  estimate  intelligibility  threshold  levels  (ITL's):  levels  equaled  or  exceed¬ 
ed  by  specified  fractions  of  populations  of  scores,  at  a  specified  confidence  level. 

The  method,  based  on  non- parametric  statistics  of  the  Kolmogorov-Smirnov  tvpe,  involves 
rank-ordering  a  population  of  scores  and  constructing  a  cumulative  distribution  and  its 
confidence  band,  from  which  ITL's  can  be  readily  assessed.  Intel  1 igibil itv  data  for 
various  processors  has  typically  shown  larger  differences  in  ITL's  than  occur  with 
mean  intelligibility  scores. 

INTRODUCTION. 

Oiag.iosttc  intelligibility  testing  (Voiers, 1973; 1977)  has  been  applied  extensively  to  test 
and  evaluation  of  a  variety  of  speech  communications  processors  and  systems  over  "he  past  decade; 
numerous  test  results  have  been  published  in  the  literature  (for  example,  Voiers  and  Smith,  1972, 
Smith, 1977;  1979).  Testing  has  usually  served  multiple  objectives  of  providing  a  basis  for 
comparing  different  speech  processor  algorithms  or  hardware,  and  guiding  research  to  "fine  tune" 
algorithms  to  obtain  superior  intelligibility  or  correct  deficiencies,  but  also  for  estimating 
intelligibility  predicted  for  the  processor  when  used  in  a  "real  world"  environment  tor  support 
of  voice  coir_.iuricatior  s  for  some  population  of  talkers  and  listeners,  presumably  not  too  differ¬ 
ent  from  those  used  in  conducting  the  tests.  The  average  intelligibility  scores  customarily 
cited  for  voice  systems  carry  the  implication  of  representing  values  that  would  be  equaled  or 
exceeded  by  507.  of  an  underlying  aggregate  of  scores  for  individual  talkers,  listeners,  and 
phonetic  features,  a  population  sampled  in  the  process  of  intelligibility  testing,  presumably 
normally  distributed  and  representative  of  a  "real  world"  communications  environment. 

RATIONALE  FOR  THE  INTELLIGIBILITY  THRESHOLD  LEVEL  (ID  PERFORMANCE  RATING 

Even  if  populations  of  intelligibility  scores  obtained  in  multi-speaker  tests  were  normally 
distributed  -  and  there  is  considerable  evidence  that  this  is  customarily  not  the  case  (Smith, 
1979)  -  it  would  seem  appropriate  to  reassess  the  practice  of  specifying  performance  in  terms  of 
average  intelligibility  scores.  Instead,  intelligibility  levels  estimating  the  values  attained 
or  exceeded  by  a  majority  in  the  populations  of  scores,  for  example  807.  of  the  scores, at  a  speci¬ 
fied  confidence  level,  would  seem  to  be  more  meaningful  and  relevant  performance  ratings .espec¬ 
ially  so  for  assessing  voice  svstems  required  to  support  critical  communications  involving  brief, 
terse  messages  and  where  misunderstandings,  or  time  lost  in  requiring  messages  to  he  repeated, 
could  result  in  severe  cost  penalties.  From  this  perspective,  a  performance  rating  in  the  form  of 
an  Intel  1 igibil itv  threshold  level  (ITL)  is  proposed:  an  experiments •  determined  intelligibility 
level  (based  on  testing  with  appropriate  talkers,  listeners,  and  test  items,  such  as  current 
multi-speaker  versions  of  the  Diagnostic  Rhvme  Test)  specifying  a  forecast  of  the  intelligibility 
level  that  will  be  equaled  or  exceeded  by  a  specified  percentage  of  a  population  of  intelligib¬ 
ility  scores,  at  a  stated  confiderce  level. 
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EXAMPLE  UF  INTELLIGIBILITY  THRESHOLD  LEVELS  (ITL’s) 

.Some  examples  of  ITL’s  determined  from  distributions  of  diagnostic  intelligibility  scores  of 
s.  .*.•  well-known  digital  speech  processors  follow.  Each  was  determined  by  utilizing  the  24  basic 
diagnostic  sores,  i.e.  the  four  scores  ascertained  fur  each  of  the  six  primary  consonantal  feat¬ 
ures  tested  with  the  Diagnostic  Rhyme  Test,  forming  a  composite  data  table  compos'ed  of  feature 
scores  of  all  speakers  in  the  test.  The  composite  table  was  then  rank-erdered  and  used  to  con¬ 
struct  a  cumulative  distribution.  Except  f-ar  the  tact  that  each  datum  (score)  represented  an 
average  ol  the  responses  of  eight  listeners,  there  was  no  averaging:  for  example,  separate  scores 
were  included  for  voicing  present  (frictional) ,  voicing  present  (non-f r ictional) ,  voicing  absent 
(frictional)  and  voicing  absent  (non-f rictional )  for  each  speaker,  etc.  The  details  of  phonetic 
feature  s«.  ns  that  typically  reveal  idiusyncracies  and  deficiencies  peculiar  to  a  voice  process¬ 
or  algorithm  are  retained  in  the  cumulative  plots  of  scores.  While  the  distributions  do  not 
reveal  specific  identities  of  features  and  speakers  involved  in  particular  "bad”  scores,  that 
information  is  readil’.  available  from  the  conventional  listing  of  diagnostic  scores  when  needed 
i00  1  for  the  purpose  of  diagnosing  specific  deficiencies. 


r  The  cumulative  plots  summarize  in  a  compact 

and  efficient  way  all  of  the  information  contained 
in  the  detailed  scores.  They  can  facilitate  formal 
J  or  informal  assessment  of  the  proportion  of  "bad" 
Normal  oa,,»  fj.  -  90  3  1  scores,  something  not  easily  determined  by 

<r  =  14  9  H  inspecting  the  listing  of  scores.  Confidence  bands 

for  the  distributions  were  constructed  by  the 
“o  method  of  Kolmogorov  (Conover,  19/1),  these  prov- 

’o.*»'-buf.on  of  Pnontttc  F#oiur»  Scor«t  J  iding  the  basis  for  ascertaining  TTT.'s.  Cumulative 

cvsd  16  hb  *  distributions  also  provide  a  basis  for  other  non- 

v  ,,bp,  |  parametric  tests  for  comparisons  with  other  stand- 

(6  M  a  3  F»m  Spioktn)  r-3  arcj  or  expe  r  iment  a  I  discrete  or  continuous  forms. 

r  One  such  test  is  the  Lilliefors  test  for  conform- 

<j  •  ity  with  a  normal  distribution  (Lilliefors,  19 47). 

0  u  0"i  •  9ix  coM.tfvid  l  m,»  J  Application  of  thi6  test  to  distributions  of  diag- 

‘  /j-J  nostic  intelligibility  scores,  including  single  and 

r*  -  '  ’  multi-speaker  data  for  a  variety  of  processors,  has 

_ *  79  B‘ 5  in  all  cases  indicated  that  the  null  hypothesis 

5  q‘  ’  20  *0  ~~~  60  BO  J00  (f°r  conformity  with  a  normal  distribution)  should 

RANKED  DRT  i NT e lliGi B: lity  SCORE  be  rejected.  Distributions  of  total  intelligibility 

(ina.v.duoi  SeioHr,  by  fotur;  /  scores  (individual  listeners)  have  been  found  appr¬ 

oximately  normal  for  a  single  speaker,  but  deviat- 
1.  Determination  of  JTL  for  1^  kbps  CVSD.  c  „ ,  ■  „  , „ 

K  ing  significantly  with  multiple-speaker  data. 

I  presents  the  distribution  of  scores  from  evaluation  of  continuous  variable-slope  delta 
lation  (CVSD)  opt  rating  at  In  kilobits  per  second,  A  one-sided  95%  confidence  band,  and  the 
al  ogive  for  the  mean  and  variance  found  in  this  data  set  from  tests  with  nine  talkers  (6  male 
3  female)  are  shown.  The  ITL  estimate  from  this  data  suggests  that,  at  the  95%  confidence 
l,  8U%  of  the  scores  in  populations  for  which  the  data  is  a  random  samp] e  will  equal  or  exceed 
.  The  80,  value  in  the  actual  data  distribution  was  81.3. 
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l.  Determination  of  ITL  for  1  A  kbps  CVSD. 


ALTERNATIVE  DATA  BASES  FOR  Ill.'s 

Assuming  that  mul t i-speaker  diagnostic  intelligibility  data  has  been  established  by  formal 
i ntel 1 igih-i  l ity  testing,  several  alternative  methods  can  be  considered  for  selecting  the  data 
base  used  in  determining  TTL's,  differing  in  the  basis  of  selecting  an  aggregate  of  scores  for 
the  assessment.  One  procedure  has  already  been  described  above.  Some  other  possibilities 
are  (2)  separating  the  population  of  scores  previously  described  into  sub-groups  consisting  of 
voiced  and  unvoice!  scores  resnect i vel v,  and  determining  TTL's  separately  for  the  two  categories; 
(3)  averaging  across  phonetic  feature  scores  obtained  with  each  listener,  thus  creating  a  pop¬ 
ulation  of  total  i ntel ligibility  scores,  one  for  each  speaker /listener  combination,  and  (4)  util¬ 
izing  the  separate  scores  obtained  for  every  speaker/ feature/ listener  combination. 

Application  of  method  (2)  is  illustrated  in  Fig.  2.  The  diagnostic  scores  for  lb  kbps  CVSD 
of  Fig.  1  were  separated  into  groups  representing  voiced  and  unvoiced  features;  separate  rankings 
ad  TTL's  were  established  for  the  two  groups.  The  <Jata  represent  a  trend  found  in  scores  for 
voice  processors,  in  which  values  of  ITL's  for  voiced  features  are  higher,  and  unvoiced  features 
lower,  than  the  ITL  values  for  the  total  population  of  scores.  Further  comparisons  are  shown  in 
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Table  I,  which  presents  ITL  ratings  from  total  aggregates  of  scores  (as  described  in  the  first 
method)  and  separate  ITL’s  for  the  data  separated  Into  populations  of  voiced  and  unvoiced  feature 
scores.  The  comparisons  suggest  that  the  greatest  potential  for  improving  intelligibility  will 
lie  in  improving  the  modeling  of  unvoiced  speech  events. 


Method  (3),  using  distributions  of  total 
i ntel 1 igibil ity  scores  (for  individual  listeners 
and  sneakers)  was  found  to  result  in  distribut¬ 
ions  approximately  normal  with  a  single  speaker, 
but  deviating  significantly  with  multiple  speak¬ 
ers  (on  a  basis  of  Lilliefors  test  results). 
However,  in  all  cases  these  distributions  were 
much  less  skewed  than  distributions  of  diagnostic 
scores  for  phonetic  features.  Method  (3)  reveals 
variations  in  listener  performance  not  revealed 
by  the  first  method  described;  however,  listener 
variations  have  Leer,  found  to  be  much  smaller 
than  variations  due  to  phonetic  features  or  due 
to  speakers.  A  major  limitation  of  method  (3)  is 
the  failure  to  reveal  significant  deficiencies 
among  the  feature  scores.  For  this  reason, 
evaluation  by  this  grouping  of  data  is  considered 
to  have  limited  value. 

Method  (4)  would  offer  composite  information 
about  talkers,  phonetic  features,  and  listeners. 
However,  it  poses  a  major  shortcoming;  the  192- 
word  Diagnostic  Rhyme  Test  includes  only  eight 
tokens  for  each  of  the  24  feature  states. 
Consequently  method  (4)  would  result  in  distr¬ 
ibutions  with  extremely  gross  quantization  of 
the  scale  (nine  possible  values  for  the  scores) 
resulting  in  inadequate  resolution. 


RANKED  DRT  INTELLIGIBILITY  SCORE 
(Individual  Spdakdrt,  by  Faoturtt) 

Fig.  2.  Separate  distributions  and  80%  ITL's  for 
voiced  and  unvoiced  scores  from  the  data  of  Fig. 
1.  The  data  suggest  that  improvements  in  intell¬ 
igibility  could  best  be  found  through  improved 
modeling  of  unvoiced  consonant  events. 


Steps  in  determining  ITL's  (using  the  first  method  described)  are  as  follows; 

1.  Rank-order  the  population  of  intelligibility  scores  comprised  of  the  24  feature-state  (or 
"sub-feature")  scores  of  each  talker,  combined  into  an  aggregate  data  population.  2.  Using  the 
ranked  table,  construct  the  cumulative  distribution  of  scores.  For  this  purpose,  each  datum  is 
interpreted  as  a  segment  of  (1/n  X  100)%.  3.  Construct  a  confidence  band,  using  an  appropriate 

quantile  from  a  table  of  values  of  the  Kolmogorov  test  statistic  (Conover, 1971)  or  the  value 
Q  =  1.22//F  (for  a  one-sided  band  at  p  =  .95,  and  n > 40) .  (Note  that  with  24  feature  scores  for 
each  of  six  speakers,  Q  =  .0983,  or  "#.83%).  In  reference  to  the  figures,  the  value  of  Q  desig¬ 
nates  an  amount  of  vertical  displacement  of  the  data  distribution  required  to  define  the 
specified  confidence  band.  4.  To  obtain  desired  ITL's,  read  from  the  confidence  band  profile  the 
corresponding  intelligibility  level.  This  can  be  done  graphically,  cr  for  greater  accuracy,  from 
the  listing  of  ranked  scores  and  associated  cumulative  percentages  (as  done  with  the  examples  of 
ITL's  presented  here). 


PROPOSED  ADVANTAGES  OF  THE  INTELLIGIBILITY  THRESHOLD  LEVEL  (ITL)  RATI NC . 


It  is  proposed  that  ITT.  ratings  have  several  advantages  over  average  intelligibility  scores 
for  rating  intelligibility  performance  of  voice  communications  processors;  1.  An  ITL  rating 
assesses  intelligibility  performance  attained  for  the  ma jori t y  of  scores  rather  than  the  average 
score.  The  rating  provides  information  as  to  whether  a  voice  processor  caused  a  significant 
proportion  of  "bad"  scores  for  any  combinations  of  speaker  and  phonetic  features.  2.  A  confidence 
level  is  define-’  for  the  TTL  ratine.  3.  The  TTL  rating,  based  on  non- paranet ri c  properties  of  the 
distributio  of  scores,  is  valid  without  regard  to  the  form  of  the  distribution  of  scores,  and 
is  not  affected  bv  departure  from  normality.  4.  The  rating  is  ir.herentlv  compel  sated  for  the  num¬ 
ber  of  speakers  and/or  the  number  of  replications  of  a  test.  5  TTL  ratings  for  various  process¬ 
ors  nave  1 ee  found  to  shot*  treater  differe  ces  among  process  rs  than  revealed  i  .  average  scores 
6  The  ITL  does  not  require  anv  rev  testing  method;  given  detailed  data,  it  is  applicable  for 
evaluati  g  Intel  1 igibil It  data  not  only  for  tie  Diag  ostic  Rhyme  Test  or  DRT,  but  also  for  the 
Modified  R-  .me  Test  T  "RT  (House  et  al ,  19o5)  or  the  Conso.  a  t  Recognition  Test  or  CRT  (Preusse, 
1959).  However,  details  f  diagr  Stic  scores  by  speakers  and  features  are  k; ow:  onl v  for  the 
Diagnostic  Rhyme  Test 
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TABLE  I  COMPARISONS  OF  INTELLIGIBILITY  THRESHOLD  LEVELS  ( I T L  • ) 


from  Diagnostic  Rhyme  Tett  Scores ,  6  Mo'*  8  3  F# mole  Speakers 

and  8  Listener  s 


PROCESSOR 

Mean 

80%  Intelligibility 

Threshold  Level  ( 95%  Confidence ) 

Score 

Voiced 

Unvoiced 

Totoi 

C  VS  0  9  6  » bp* 

80  i 

62  3 

25  0 

33  l 

CVSO  '6  » bp* 

90  3 

81  3 

36  3 

71  9 

CVSO  32  kbp« 

95  0 

90.6 

71  9 

875 

Ch  vocoder  2400 

83  0 

62  3 

37  3 

39  4 

*  There  n  95%  confidence 

•hot  BOX  of  the 

populoi  ion 

of  diognoshc 

intolligibiht)  scores  I feoture  scores.  by  Speakers  I  mill  eguOl 
or  *ac ood  fho  slated  value 


It  is  hypothesized  that  ITL  ratings  may  be  more  closely  correlated  witn  user  acceptance  of 
digital  speech  processors  than  are  average  intelligibility  scores;  however,  no  data  has  been 
available  to  permit  a  test  of  this  hypothesis. 

CONCLUSIONS. 

The  relatively  new  techniques  of  multi-speaker  diagnostic  intelligibility  testing  have 
tended  to  overwhelm  the  evaluator  with  the  mass  of  information  contained  in  typical  intelligibil¬ 
ity  test  results.  Perhaps  because  of  the  difficulties  of  evaluating  and  interpreting  fine 
details  of  performance,  there  has  been  a  tendency  to  reduce  results  to  single  numbers:  the  average 
intelligibility  scores,  values  that  provide  no  information  as  to  variations  among  individual 
speakers  and  among  scores  for  phonetic  features,  even  though  the  salient  differences  in  various 
speech  processor  algorithms  are  usually  revealed  in  these  details.  Preparation  of  cumulative 
distributions  of  scores  can  provide  a  means  for  summarizing  entire  populations  of  scores  In 
meaningful  but  compact  form  that  contains  the  significant  variations  and  highlights  deficiencies. 
Preparation  of  confidence  bands  for  distributions  can  permit  forecasts  of  intelligibility 
threshold  levels  (ITL’s)  for  selected  proportions  of  scores  at  specified  confidence  levels. 

Studies  of  ITL  ratings  of  intelligibility  performance  obtained  with  various  speech  processors 
operating  under  a  variety  of  conditions  should  provide  guidance  for  determining  minimum  ITL 
standards  that  would  be  appropriate  performance  criteria  for  various  applications  in  communication. 
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Appendix  D 

Table  of  the  Kolomogorov  Test  Statistic 
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